Learning Knowledge Bases for Information Extraction from Multiple Text Based Web Sites

نویسندگان

  • Xiaoying Gao
  • Mengjie Zhang
چکیده

We describe a learning approach to automatically building knowledge bases for information extraction from multiple text based web pages. A frame based representation is introduced to represent domain knowledge as knowledge unit frames. A frame learning algorithm is developed to automatically learn knowledge unit frames from training examples. Some training examples can be obtained by automatically parsing a number of tabular web pages in the same domain, which greatly reduced the time consuming manual work. This approach was investigated on ten web sites of real estate advertisements and car advertisements and nearly all the information was successfully extracted with very few false alarms. These results suggest that both the knowledge unit frame representation and the frame learning algorithm work well, domain specific knowledge base can be learned from training examples, and the domain specific knowledge base can be used for information extraction from flexible text-based semi-structured Web pages on multiple Web sites.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Dependency-Based Open Information Extraction

Building shallow semantic representations from text corpora is the first step to perform more complex tasks such as text entailment, enrichment of knowledge bases, or question answering. Open Information Extraction (OIE) is a recent unsupervised strategy to extract billions of basic assertions from massive corpora, which can be considered as being a shallow semantic representation of those corp...

متن کامل

Constructing Biological Knowledge Bases by Extracting

Recently, there has been much eeort in making databases for molecular biology more accessible and interoperable. However, information in text form, such as MEDLINE records, remains a greatly underutilized source of biological information. We have begun a research eeort aimed at automatically mapping information from text sources into structured representations , such as knowledge bases. Our app...

متن کامل

Inference Over the Web

Inference Over the Web Stefan Schoenmackers Co-Chairs of the Supervisory Committee: Professor Oren Etzioni Computer Science and Engineering Professor Daniel S. Weld Computer Science and Engineering The World Wide Web contains vast amounts of text written about nearly any topic imaginable. Recent work in Information Extraction has sought to recover the information stated in this text, aggregatin...

متن کامل

Using Semantics and Statistics to Turn Data into Knowledge

SPRING 2015 65 Agrowing body of research focuses on extracting knowledge from text such as news reports, encyclopedic articles, and scholarly research in specialized domains. Much of this data is freely available on the World Wide Web and harnessing the knowledge contained in millions of web documents remains a problem of particular interest. The scale and diversity of this content pose a formi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003